class: center, middle, inverse, title-slide .title[ # Transformation of Covariates ] .subtitle[ ## Lecture 8 ] .author[ ### Manuel Villarreal ] .date[ ### 09/30/24 ] --- ### Linear Splines - In the previous lecture we mention that when we use a linear spline we are basically joining two or more independent lines at some point. -- - Let's call that point a joint and denote it by the variable `\(c_k\)` where k represents the joint number. -- - Formally, for every joint `\(c_k\)` in the spline (point where two lines join), we want to make sure that the following is true: `$$\lim_{X_i\to c_k^{-}} \mathrm{E}\left(Y_i \mid f(X_i)\right) = \lim_{X_i\to c_k^{+}} \mathrm{E}\left(Y_i \mid f(X_i)\right)$$` -- - This means that we want the expected value of our dependent variable `\(Y\)` to be the same if we approach the joint `\(c_k\)` from the negative side of `\(X_i\)` that if we approach it from the positive side. --- ### Linear Splines - In other words, we want our expectations to transition smoothly from one line to the other without any abrupt jumps. -- - Let's check our example from the precious class and see if this condition is met. -- - Formally, our model states that: `$$\mathrm{E}\left(Y_i \mid f(X_i)\right) = \beta_0 + \beta_1\left(I(X_i > 40)(X_i - 40)\right)$$` -- - Where `\(I(X_i > 40)\)` is an indicator function that takes the value 0 if `\(X_i\leq 40\)` and the value `\(1\)` otherwise. -- - Notice that in this case `\(f(X_i) = I(X_i > 40)(X_i - 40)\)`. --- ### Linear Splines: continuity - Now we can check what the limit of our function is. -- - Let's start from the side of the negative numbers. Given that the joint is at `\(c_1 = 40\)` we have that: `$$\lim_{X_i\to c_k^{-}} \mathrm{E}\left(Y_i \mid f(X_i)\right) = \lim_{X_i\to c_k^{-}}\left( \beta_0 + \beta_1\left(I(X_i > 40)(X_i - 40)\right) \right)\\ \lim_{X_i\to 40^{-}} \mathrm{E}\left(Y_i \mid f(X_i)\right) = \beta_0 + \beta_1\left(\lim_{X_i\to 40^{-}}I(X_i > 40)(X_i - 40)\right)$$` --- ### Linear Splines: continuity - According to the definition of a limit, we can approach arbitrarily close to the value `\(c_k\)` (in this case `\(40\)`) from the left, however, we will never reach that value. In other words we have that: `$$\lim_{X_i\to 40^{-}} \mathrm{E}\left(Y_i \mid f(X_i)\right) = \beta_0 + \beta_1\left(0\cdot\lim_{X_i\to 40^{-}}(X_i - 40)\right)$$` -- - The limit of our function `\(\lim_{X_i\to 40^{-}}(X_i - 40)\)` is being multiplied by `\(0\)`. Therefore we know that regardless of the value of `\(X_i\)`, the equation can be simplified as: `$$\lim_{X_i\to 40^{-}} \mathrm{E}\left(Y_i \mid f(X_i)\right) = \beta_0$$` -- - This means that for people who are `\(40\)` years old or younger we expect their memory performance to be `\(\beta_0\)`. --- ### Linear Splines: continuity - Now let's look at the limit from the positive side: `$$\lim_{X_i\to c_k^{+}} \mathrm{E}\left(Y_i \mid f(X_i)\right) = \lim_{X_i\to c_k^{+}}\left( \beta_0 + \beta_1\left(I(X_i > 40)(X_i - 40)\right) \right)\\ \lim_{X_i\to 40^{+}} \mathrm{E}\left(Y_i \mid f(X_i)\right) = \beta_0 + \beta_1\left(\lim_{X_i\to 40^{+}}I(X_i > 40)(X_i - 40)\right)$$` -- - Now we are approaching the joint `\(c_k\)` (which is still `\(40\)`) from the positive side of the real line, which means that we can get arbitrarily close to 40 without reaching that value. Therefore, we know that `\(I(X_i > 40) = 1\)`. `$$\lim_{X_i\to 40^{+}} \mathrm{E}\left(Y_i \mid f(X_i)\right) = \beta_0 + \beta_1\left(\lim_{X_i\to 40^{+}}1\cdot (X_i - 40)\right)$$` --- ### Linear Splines: continuity - This time there is no zero to multiply the limit, however, we can think about what will happen as we approach arbitrarily close to `\(40\)` from the positive side. -- - We want to figure out: What happens to the `\(\lim_{X_i\to 40^{+}}(X_i - 40)\)` as we move closer and closer to 40? Well, it gets closer and closer to 0! -- - In other words, as we move arbitrarily close to 40 the limit will move arbitrarily close to 0, which means that `$$\lim_{X_i\to 40^{+}} \mathrm{E}\left(Y_i \mid f(X_i)\right) = \beta_0 + \beta_1\left(1\cdot 0\right) = \beta_0$$` -- - And with this we have shown that both expectations are equal! `$$\lim_{f(X_i)\to c_k^{-}} \mathrm{E}\left(Y_i \mid f(X_i)\right) = \beta_0 = \lim_{f(X_i)\to c_k^{+}} \mathrm{E}\left(Y_i \mid f(X_i)\right)$$` --- ### More than two lines - What does this mean in practice? Well, it means that we have to make sure that two consecutive lines have the same value at their joint. -- - Now that we know that we have to make sure that two lines have the same value at their joint, we can think of adding more lines to a model. -- - Let's go back to our average temperature example and model the data from the city of Miami using linear splines. --- ### Spline: Average temperature - Open the file `in-class-09.Rmd` load the `cities.csv` data and use the subset function to keep only the data for the city of Miami. -- - We can use the piping operator as well as the subset function to keep the data only for the city of Miami ``` r cities <- cities |> subset(subset = city == "miami") ``` --- ### Spline: Average temperature - Now we can make a box plot of the temperature in Miami by month. -- <img src="data:image/png;base64,#lecture-08_files/figure-html/unnamed-chunk-2-1.png" width="40%" style="display: block; margin: auto;" /> --- ### Spline: Average temperature - Let's start with a linear model that includes 2 nodes and 3 lines. -- - We can set a node on the third month and a second node on the 8th month. -- - What does this mean? It means that temperature should have a consistent rate of change `\((\beta)\)` between the months of January to March. -- - A different but also consistent rate of change in the months between March and August. -- - And finally, a different rate of change between the months of August and December. -- - How can we do this and still maintain the continuity assumption in our model? -- - Well we have to use a combination of indicator and difference functions! --- ### Spline: Average temperature - For example, we know that the average temperature should have an intercept different than `\(0\)` to start. -- - Then it might be the case that the temperature increases slightly between the months of January to March. -- - We can achieve this by combining an indicator function that assigns a value of 1 to months between January and March and then a 0 to any other month. -- - Mathematically we can express this function as: `$$\mathrm{E}\left(Y_i \mid f_1(\text{month}_i)\right) = \beta_0 + \beta_1 \text{month}_i + \beta_2I(1 \leq \text{month}_i \leq 3) (\text{month}_i)$$` -- - This first part of the equation makes sure that the line that connects the months of January and March has a single slope and intercept. --- ### Spline: Average temperature - Add a variable to your data set that assigns the value of `\(\text{month}_i\)` to every observation made in the months of January to March and 0 otherwise (question 3 `in-class-09.Rmd`). --- ### Spline: Average temperature ``` r cities <- cities |> dplyr::mutate("id_jan_mar" = ifelse(test = month <= 3, yes = month, no = 0)) ```
--- ### Spline: Average temperature - With this new function we have that: `$$\mathrm{E}\left(Y_i \mid f_1(\text{month}_i = 1)\right) = \beta_0 + \beta_1 + \beta_2\\ \mathrm{E}\left(Y_i \mid f_1(\text{month}_i = 2)\right) = \beta_0 + 2\beta_1+ 2\beta_2\\ \mathrm{E}\left(Y_i \mid f_1(\text{month}_i = 3)\right) = \beta_0 + + 3\beta_2 + 3\beta_1$$` -- - In this case the parameter `\(\beta_1\)` represents the rate of change in average temperature in the city of Miami as a function of the month. -- - The parameters `\(\beta_1 + \beta_2\)` represent the rate of change in temperature for a month change between the months of January to March. The interpretation of `\(\beta_0\)` depends on how we handle the rest of the lines in the model! --- ### Spline: Average temperature - Now we have to decide which month will represent the next joint. -- - From the data it looks like the trend in average temperature increases at the same rate in the months between March to August. -- - That means that we can have a new line that starts at March and goes all the way to August. -- - However, remember that we have to satisfy the continuity assumption, which means that we want to find a function that makes the following equation true `$$\lim_{\text{month}_i\to \text{3}^{-}} \mathrm{E}\left(Y_i \mid f_1(\text{month}_i), f_2(\text{month}_i)\right) =\\ \lim_{\text{month}_i\to \text{3}^{+}} \mathrm{E}\left(Y_i \mid f_1(\text{month}_i), f_2(\text{month}_i)\right)$$` --- ### Spline: Average temperature - Once again let's use an indicator function and a difference function to generate our new variable `\(f_2(\text{month}_i)\)` `$$f_2(\text{month}_i) = I(3 < \text{month}_i \leq 8)(\text{month}_i - 3)$$` -- - This function will satisfy the continuity assumption. -- - Add this function as a new variable to your data set with the name "id_mar_aug" (question 4 `in-class-09.Rmd`). -- - Like before, we can use the `mutate()` function alongside an `ifelse()` function to generate our new variable: ``` r cities <- cities |> dplyr::mutate("id_mar_aug" = ifelse(test = month > 3 & month <= 8, yes = month - 3, no = 0)) ``` --- ### Spline: Average temperature - Finally, we need a third that will allow us to set a different slope `\((\beta)\)` to the months of September to December. -- - Once again we can use an indicator and difference functions in order to make sure that we meet the continuity assumption. -- - Add a new variable to your data set named "id_aug_dec" that combines an indicator function and a difference function to assign values to the months of September, October, November, and December, that will meet the continuity assumption (question 5 `in-class-09.Rmd`). -- - Like we did before we will use the mutate and ifelse functions: ``` r cities <- cities |> dplyr::mutate("id_aug_dec" = ifelse(test = month > 8 & month <= 12, yes = month - 8, no = 0)) ``` --- ### Spline: Average temperature - Now we can fit the following model to the data (question 6 `in-class-09.Rmd`): `$$Y_i = \beta_0 + \beta_1\text{month}_i + \beta_2f_1(\text{month}_i) + \beta_3f_2(\text{month}_i) + \beta_4f_3(\text{month}_i) + \epsilon_i$$` -- - Remember that the three functions we have are the same as our new variables we created, so we can fit this model with ``` r lm_spline_miami <- lm( formula = avgtemp ~ month + id_jan_mar + id_mar_aug + id_aug_dec, data = cities) ``` --- ### Plot: expected average temperature - Make a plot of the average temperature of Miami by month and add the estimated expected average temperature from our new model (question 7 `in-class-09.Rmd`). -- <img src="data:image/png;base64,#lecture-08_files/figure-html/unnamed-chunk-8-1.png" width="40%" style="display: block; margin: auto;" /> --- ### Parameter interpretation - `\(\beta_0\)` is the estimated average temperature for the city of Miami at month 0, like in other models we have used, the intercept is not easy to interpret. -- - `\(\beta_1\)` is the estimated change in the average temperature for the city of Miami for a month difference. -- - `\(\beta_1 + \beta_2\)` is the estimated change in the average temperature for the city of Miami for a month difference between the months of January to March. -- - `\(\beta_1 + \beta_3\)` is the estimated change in the average temperature for the city of Miami for a month difference between the months of March to August to December. - `\(\beta_1 + \beta_4\)` is the estimated change in the average temperature for the city of Miami for a month difference between the months of August to December. --- ### Residuals - Like we have done before, we can look at the residuals of the model to check if there is a bias in the estimated average temperature. -- - Plot the residuals of the model that assumes that the average monthly temperature in the city of Miami has a different rate of change between the months of January to March, March to August, and August to December (question 8 `in-class-09.Rmd`). -- <img src="data:image/png;base64,#lecture-08_files/figure-html/unnamed-chunk-9-1.png" width="33%" style="display: block; margin: auto;" /> --- ### The city of San Francisco - Notice that when we start looking at the data from the city of Miami we made some assumptions about the position of the joints `\(c_1\)` and `\(c_2\)`, however, how good would these assumptions be for a different city? -- - Load the data for the city of San Francisco in a separate object and add to it the same 3 variables that we added to the data from Miami (question 9 `in-class-09.Rmd`). -- ``` r cities <- readr::read_csv(file = here::here("week-05/data/cities.csv")) |> subset(subset = city == "san_francisco") |> dplyr::mutate("id_jan_mar" = ifelse(test = month <= 3, yes = month, no = 0), "id_mar_aug" = ifelse(test = month > 3 & month <= 8, yes = month - 3, no = 0), "id_aug_dec" = ifelse(test = month > 8 & month <= 12, yes = month - 8, no = 0)) ``` --- ### The city of San Francisco - Fit a linear model using these 3 new variables to the average temperature of the city of San Francisco (question 10 `in-class-09.Rmd`). -- ``` r lm_spline_sfn <- lm( formula = avgtemp ~ month + id_jan_mar + id_mar_aug + id_aug_dec, data = cities) ``` --- ### The city of San Francisco - Make a plot of the estimated average temperature in the city of San Francisco as a function of the month according to the linear mode we just fit to the data (question 11 `in-class-09.Rmd`). -- <img src="data:image/png;base64,#lecture-08_files/figure-html/unnamed-chunk-12-1.png" width="40%" style="display: block; margin: auto;" /> --- ### Residuals - Now that we have applied the same model to both cities we can look at the residuals for the city of San Francisco (question 12 `in-class-09.Rmd`). -- <img src="data:image/png;base64,#lecture-08_files/figure-html/unnamed-chunk-13-1.png" width="504" style="display: block; margin: auto;" /> --- ### Linear Splines - Plotting the estimated expected value of the average temperature of each city against the data allowed us to see that -- 1. The assumptions we made about the city of Miami work relatively well as there is no obvious difference between the line and the median of the data. -- 1. The same assumptions applied to the city of San Francisco showed us that a model with more joints could be better, as there are some months where the estimated mean is far from the median of the data. -- - It is important to note that for the city of San Francisco we could have found a better location for the two joints in the model in comparison to using the same to points that we used for the city of Miami. -- - Those nodes where chosen by looking at data from the city first. We have no reason to believe that the same joints would work equally well across cities. --- ### Linear Splines - The problem we would like to solve is: How can we find the best number of joints and their position for a given data set? -- - There are statistical methods built from the idea of maximizing the probability of the data (Maximum Likelihood Estimation), that aim to solve this problem. -- - These method allow us to choose the number and position of the joints for a linear spline. Additionally, methods similar to the BIC allow us to penalize models that add more joints than what is guaranteed by the data. -- - In other words, the arbitrary choices that we made for this example can be made using a more systematic approach. -- - As a final note, notice that there is no reason why we can't combine the splines approach with the non-linear transformations that we used to account for the average temperature data.